Computer-Assisted Keyword and Document Set Discovery from Unstructured Text

نویسندگان

Gary King

Patrick Lam

Margaret E. Roberts

چکیده

The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend crucially on this choice, researchers typically pick keywords in ad hoc ways, given the lack of formal statistical methods to help. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depends in practice on the inadequate keyword counting or matching methods they are designed to replace. The same ad hoc keyword selection process is also used in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text, without needing any structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and then summarizing results with Boolean search strings. We illustrate how the technique works with examples in English and Chinese. ∗Our thanks to Dan Gilbert for helpful suggestions. †Institute for Quantitative Social Science, 1737 Cambridge Street, Harvard University, Cambridge MA 02138; GKing.harvard.edu, [email protected], (617) 500-7570. ‡Institute for Quantitative Social Science, 1737 Cambridge Street, Harvard University, Cambridge, MA 02138; www.patricklam.org §Department of Government, 1737 Cambridge Street, Harvard University, Cambridge MA 02138; scholar.harvard.edu/mroberts

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Keyword-Based Browsing and Analysis of Large Document Sets

Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. This paper describes the KDT system for Kno...

متن کامل

Discriminative Features Selection in Text Mining Using TF - IDF Scheme

This paper describes technique for discriminative features selection in Text mining. 'Text mining’ is the discovery of new, previously unknown information, by computer. Discriminative features are the most important keywords or terms inside document collection which describe the informative news included in the document collection. Generated keyword set are used to discover Association Rules am...

متن کامل

Efficient Text and Semi-structured Data Mining: Knowledge Discovery in the Cyberspace

This paper describes applications of the optimized pattern discovery framework to text and Web mining. In particular, we introduce a class of simple combinatorial patterns over texts such as proximity phrase association patterns and ordered and unordered tree patterns modeling unstructured texts and semi-structured data on the Web. Then, we consider the problem of finding the patterns that opti...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Modified Approach of Multinomial Naïve Bayes for Text Document Classification

This work proposes a text classification using modified approach of Multinomial Naïve Bayes for justifying and identifying the documents into a particular category. Due to the exploration of the textual information from the electronic digital documents as well as World Wide Web. Naïve Bayes theorem is effective for classification of text documents into the predefined categories by means of the ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Computer-Assisted Keyword and Document Set Discovery from Unstructured Text

نویسندگان

چکیده

منابع مشابه

Keyword-Based Browsing and Analysis of Large Document Sets

Discriminative Features Selection in Text Mining Using TF - IDF Scheme

Efficient Text and Semi-structured Data Mining: Knowledge Discovery in the Cyberspace

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Modified Approach of Multinomial Naïve Bayes for Text Document Classification

عنوان ژورنال:

اشتراک گذاری